COVID-19 patient profiles over four waves in Barcelona metropolitan area: A clustering approach

Objectives Identifying profiles of hospitalized COVID-19 patients and explore their association with different degrees of severity of COVID-19 outcomes (i.e. in-hospital mortality, ICU assistance, and invasive mechanical ventilation). The findings of this study could inform the development of multiple care intervention strategies to improve patient outcomes. Methods Prospective multicentre cohort study during four different waves of COVID-19 from March 1st, 2020 to August 31st, 2021 in four health consortiums within the southern Barcelona metropolitan region. From a starting point of over 292 demographic characteristics, comorbidities, vital signs, severity scores, and clinical analytics at hospital admission, we used both clinical judgment and supervised statistical methods to reduce to the 36 most informative completed covariates according to the disease outcomes for each wave. Patients were then grouped using an unsupervised semiparametric method (KAMILA). Results were interpreted by clinical and statistician team consensus to identify clinically-meaningful patient profiles. Results The analysis included nw1 = 1657, nw2 = 697, nw3 = 677, and nw4 = 787 hospitalized-COVID-19 patients for each of the four waves. Clustering analysis identified 2 patient profiles for waves 1 and 3, while 3 profiles were determined for waves 2 and 4. Patients allocated in those groups showed a different percentage of disease outcomes (e.g., wave 1: 15.9% (Cluster 1) vs. 31.8% (Cluster 2) for in-hospital mortality rate). The main factors to determine groups were the patient’s age and number of obese patients, number of comorbidities, oxygen support requirement, and various severity scores. The last wave is also influenced by the massive incorporation of COVID-19 vaccines. Conclusion Our study suggests that a single care model at hospital admission may not meet the needs of hospitalized-COVID-19 adults. A clustering approach appears to be appropriate for helping physicians to differentiate patients and, thus, apply multiple care intervention strategies, as another way of responding to new outbreaks of this or future diseases.


Introduction
Since the first case of respiratory syndrome coronavirus (SARS-CoV-2, also known as  in Wuhan [1], the disease has spread rapidly around the world.The World Health Organization (WHO) declared a global pandemic in March 2020.In Catalonia (Spain), the first detected case was in February 2020, and there have been six declared waves since then, with 1.3 million cases and more than 24 thousand deaths [2].
The entire population is generally susceptible to the virus and the symptoms of COVID-19 are diverse as they can be from asymptomatic patients to severe pneumonia or even death.Thus, there has been a lot of interest in trying to find the COVID-19 patient's profiles and their associated risk factors to determine the worst prognosis using as endpoints admission to intensive care unit (ICU), the requirement of invasive mechanical intubation (IVM), and death [3].
Clustering techniques are one of the most well-known unsupervised learning methods.They have been successfully applied in medical research [4][5][6] and allow us to reveal diverse group profiles and their patterns of association among factors [7].A clustering approach can help us to both predict the outcome of COVID-19 in patients and to reveal the set of key variables, which clinicians could consider when evaluating patients and making early decisions for future outbreaks [8].
There has been research focused on a single outcome of COVID-19 and the prevalence of specific symptoms [9][10][11].The most studied characteristics as risk factors are age, comorbidities, and clinical analytics [12][13][14][15][16]. Additionally, research using prediction models for diagnosis and prognosis is common [17,18].To the best of our knowledge, determining patient profiles involving the combination of relevant comorbidities, clinical results, and sociodemographic features and their relationship with relevant disease outcomes such as the requirement of IVM, ICU assistance, and in-hospital mortality has been less studied [8,[19][20][21][22][23].
We hypothesize that hospitalized-COVID-19 patients can be comprised into groups with different characteristics associated with clinical outcomes.Thus, our research aims to study those characteristics for hospitalized-COVID-19 patients in a Barcelona metropolitan region via a clustering method to identify specific patient profiles across four waves and their association with relevant outcomes of disease.

a. Setting and participants
We performed a prospective multicentre cohort study during four waves of COVID-19 (the first wave included hospitalized-COVID-19 patients between March 1 st and April 15 th , 2020; the second wave, from October 1 st to November 30 th , 2020; the third, from January 1 st to February 28 th , 2021; and the fourth, from July 1 st to August 31 st , 2021) in four health consortiums (Hospital Universitari de Bellvitge, Consorci Sanitari de l'Alt Penedès I Garraf, Hospital de Viladecans, and Hospital de Sant Boi de Llobregat) located in the Barcelona metropolitan south region (eFig 1 in S1 File depicts that geographical region), covering a population of 1,370,709 inhabitants (eTable1 of the S1 File shows the number of patients included in the study broken down by health consortiums and number of wave).All patients were adults (>18 years old) and were admitted with a PCR-proven SARS-CoV-2 infection that occurred before admission, with a maximum of 48 hours between the positive test and admission.
The study was approved by the ethics committee (CEIm Hospital Universitari de Bellvitge) in accordance with Spanish legislation and was performed in accordance with the Helsinki Declaration of 1964.The need for patient informed consent was waived by the ethics committee.

b. Analytic process
Our aim was to identify distinct hospitalized-COVID-19 patients groups with different characteristics associated with clinical outcomes via the application of a 2-step procedure: 1) a supervised method (classification tree) to select the most relevant factors, which avoids redundancy, and 2) an unsupervised semiparametric method (KAMILA) [24], kAy-means for mIxed lArge data) for clustering mixed-type data.We implemented the following four steps to achieve this goal: 1) Collecting and formatting of study data.Study data were collected and managed using REDCap [25,26].The questionnaires were designed by joint agreement of the clinical and statistician teams and filled in by physicians in each hospital.For each wave, the collected information for each hospitalized-COVID-19 patient is essentially the same and includes a diverse range of features at admission time as demographic characteristics (e.g.gender and age), the prevalence of comorbidities (e.g., smoke and Charlson index), previous treatments (e.g., statins and corticosteroids), infection symptoms and clinical exploration (e.g., oxygen saturation, respiratory frequency, and temperature), analytics (e.g., leucocytes and neutrophils) and severity scores (pneumonia severity index, pneumonia severity score (CURB-65), and viral pneumonia mortality score).We also note that the last wave (from July 1s t to August 3 1s t, 2021) includes the information about vaccination (e.g., vaccination dose received: none, partial, or full regime).Additionally, meaningful outcomes of COVID-19 such as the requirement of IVM, ICU assistance, and in-hospital mortality were also recorded during the admission period time.Finally, information about patients' ceiling of care was included [27], which was defined as the maximum therapeutic effort to be offered to a patient based on their age, their associated comorbidities, and the expected clinical benefit concerning the availability of resources.
All data was merged including an indicator variable of wave and, after the data collection was completed, an exhaustive data recovery process was performed to avoid missing information and to overcome possible capture errors during the data entry period.
2) Guided variable selection and reduction.Decisions about variables were made by clinical and statistician team consensus.Variables were dichotomized using common clinical thresholds.From the initial set of 292 variables, we identified the subset of variables that had the most relative influence to discriminate concerning the disease outcomes for each wave by using CART [28].The results were also validated with a logistic and Cox stepwise regression with bootstrapping, being all results consistent.From that selection, we ranked the variables by the percentage of missingness to determine the definitive set of variables used to identify clusters.Thus, we used the completed 36 variables listed in eTable 2 of the S1 File.All 36 variables were utilized in the determination of patients' clusters.However, Tables 1 and 2 only display the 19 statistically significant variables, along with gender for demographic interest, resulting in a total of 20 variables.
3) Determination of patients' clusters.We applied the KAMILA (KAy-means for MIxed LArge data) method, a semiparametric clustering approach for mixed-type data, which is able to combine the contribution of continuous and categorical variables without strong parametric assumptions.Several studies have demonstrated KAMILA's superiority in handling high imbalances between continuous and categorical data compared to other methods [29].The KAMILA algorithm is a development of the k-means method who combines Gaussian-multinomial mixture models [30], a model-based approach, with the k-means algorithm [31,32], a non-model-based approach, both of which have been adapted successfully to very large data sets [33,34].In the KAMILA procedure, each parameter is initialized by randomly drawing from a uniform distribution with bounds set to the minimum and maximum of each continuous variable.Categorical variables are initialized by drawing from a Dirichlet distribution with shape parameters all equal to one.The algorithm is initialized multiple times.For each initialization, the algorithm iterates until either reaching a pre-specified maximum number of iterations or until the population membership remains unchanged from the previous iteration, whichever occurs first.In our study, the number of clusters was determined by analyzing 2-10 group models.KAMILA has been applied in recent health studies [35][36][37][38] to stratify individuals into clusters according to the outcomes of interest.4) Interpretation of cluster profiles.We used a 3-fold strategy to derive clinical and sociodemographic meaning from the resulting clustering structures.Firstly, we compared the results of the patients' groups for the first wave obtained with KAMILA with those obtained applying the k-means algorithm [39],which uses only numerical variables.Secondly, we compare the groups among the four COVID-19 waves to investigate patterns of commonality and dissimilarity.Thirdly, we presented the clustering structures resulting from the statistical analysis to physicians who are part of the DIVINE research team and worked with them to define clinically meaningful patient profiles.During this procedure, we discussed the medical interpretation and contextualization of the results, and reached a consensus on the definitions.

c. Statistical analysis
All statistical analysis and graph display was carried out using the statistical software R version 4.0.2(R Project for Statistical Computing).
Classification trees and bootstrapping.The assessment of the importance of the variables according to the relevant dichotomous relevant disease outcomes (in-hospital mortality, ICU assistance, and invasive mechanical ventilation) was determined using a classification tree methodology (CART [28]) based on Gini's impurity index as the splitting criterion.CART was implemented with the R package rpart [40].
KAMILA.The application of the KAMILA method was carried out using the kamila [41] R package.Determining the clustering tendency of the data (i.e., its clusterability) was assessed using Hopkins' statistic [42].Model selection to decide the optimal number of clusters was performed according to the prediction strength method [43].The prediction strength indicates the proportion of pairs of cases classified in the same cluster in both the training and test samples, considering the cluster with the worst performance.Since a separate test sample is often not available, the procedure employs repeated two-fold cross-validation.In this process, the first fold serves as the training sample, while the second fold serves as the test sample.Moreover, it is important to remark that class sample sizes were also considered since inadequate sample sizes can lead to convergence problems.
Post-hoc analysis.For each clustering structure, the differences of the categorical variables were measured using the Pearson's Chi-squared test and continuous variables were compared using the Mann-Whitney U test or Kruskal-Wallis test (including pairwise comparisons) to assess differences between clusters for two or more than two cluster solutions, respectively.A two-sided p-value < 0.05 was considered statistically significant.Finally, result among waves has been compared for consistency.

KAMILA results
The optimal number of patient groups identified was two for waves 1 and 3 and three for waves 2 and 4. The prediction strength values for the KAMILA clustering in the four waves are depicted in eFig 2 of the S1 File.Generally speaking, the variables that were found to be most important in determining the clusters for the first three waves included the patient's age, the number of obese patients, the number of comorbidities, the requirement for oxygen support, and various severity scores such as SaFi illness [45,46].SaFi illness is defined as the ratio between the oxygen saturation and the fraction of inspired oxygen.(i.e., SatO 2 /FiO 2 ).For the last wave, the vaccination dose that the patient has received is the most determining variable.

Patient profiles
Tables 1 and 2 show the resulting clustering structures for waves 1 and 4, respectively (see eTables 3 and 4 of the S1 File for waves 2 and 3, respectively) and eFig 4 in S1 File depicts a graphical comparison of sociodemographic and clinical characteristics 1 between KAMILA clusters for each wave.In wave 1 (Table 1), 74.0% of patients (n w1 = 1226) were grouped into Cluster 1, which allocates patients with the best prognostic (see Fig 1 and Table 3).Those patients are characterized by lower ages (median [IQR]: 65[53-76] (Cluster 1) vs. 70[57-79] (Cluster 2)), smaller and not dispersed FiO 2 value, larger SatO 2 /FiO 2 values, and consistently, a lower number of age-related comorbidities and scores in pneumonia-related measures (PSI, CURB-65, and MuLBSTA).On the other hand, the set of patients assigned to Cluster 2 presents more aging-related diseases such as dementia, chronic pulmonary pathology, heart failure, hypertension, degenerative neurological disease, and peripheral vascular diseases.It also includes more men and a greater number of obese patients.In Wave 2 (see eTable 3 in S1 File), the patients were divided into three clusters.Cluster 3 had a higher percentage of men, more obese patients, and presented worse prognostic indicators (see Fig 1).Cluster 1 included younger patients with fewer comorbidities, while the opposite was found for patients in Cluster 2.
For patients in Wave 3 (see eTable 4 in S1 File), two clusters were identified.There were no significant differences in the number of comorbidities (Charlson index) between the clusters, but Cluster 1 contained patients with worse health outcomes, lower SatO 2 /FiO 2 levels, older age, higher pneumonia scores, and more severe pneumonia symptoms.
In wave 4 (Table 2), we observe a clear vaccination effect in the classification of the patients among clusters.Cluster 3 has allocated the elderly patients (median [IQR]: 78.5 [67, 86]), who were fully vaccinated (86.8%) in comparison with the other two clusters of patients (Clusters 1 and 2 present the larger percentage of non-vaccinated patients: 74.3% and 78.0%, respectively).Between the two less-vaccinated groups of patients, those allocated in Cluster 2 presented worse health outcomes, a small ratio of full regimen vaccination (16.6%) combined with a wider age's range (median [IQR]: 52 [39,65]).Furthermore, patients in Cluster 2 presented a smaller SatO 2 /FiO 2 (mean (SD): 233.0 (84.0)) and worse viral pneumonia mortality score (MuLBSTA) than in patients in Cluster 1.The elderly patients (Cluster 3), despite having a larger number of comorbidities, are not labeled as the worst prognostic group as they received full-regimen vaccine and could have potentially survived the previous waves.In the current wave (July 1st-August 31st, 2021), they may have also received full-regimen vaccination, which could have contributed to their improved prognosis.
A visualization of data of all profiles described here via the first two first components from Principal Components Analysis is shown in eFig 3 of the S1 File.
The characteristics highlighted in the identified clusters of patients lead to different degrees of disease severities of COVID-19 outcomes (variables not used to determine the clusters), which were compared across the clusters in each wave (Fig 1).Differences in the outcomes Table 3. Outcomes (in-hospital mortality, ICU assistance, and invasive mechanical ventilation) according to patient profiles defined by clustering in each wave.Sample sizes and proportion for each cluster are indicated, frequencies are shown in each cell and the figures between parenthesis are the % of the outcome for that determined cluster.

Wave 1 (2 clusters) Outcomes
Cluster IVM, ICU assistance, and in-hospital mortality are obvious over all waves (see Table 3 for statistical tests).Thus, the profiles of the patients with a worse prognosis in the COVID-19 outcomes are those that correspond to the most severe clinical characteristics.

Discussion
This study identified grouping structures of two (waves 1 and 3) and three (waves 2 and four) hospitalized-COVID-19 patient profiles across the four waves using the clinical and demographic information and applying a clustering approach, which allowed data learning without any prior hypothesis.In our view, patients' profiles can be clearly divided between waves 1-3 and wave 4. Some profiles were narrowly close in their descriptions between the first three waves.Thus, the clustering approach always determined a group of hospitalized patients who had the worse rates of COVID-19 outcomes and were characterized to be elderly obese men with a medium/large number of comorbidities and low degree of vital signs and severity scores.Generally speaking, the other clusters found in these three first waves are distinguished by the number of comorbidities and age.The number of vaccination doses received is crucial to discriminate groups in wave 4: Clusters 1-2 had a 22-25% of partial or full vaccinated rate vs. the 92% in Cluster 3.Moreover, age is probably the most important variable to differentiate between the profiles of the first two clusters: Cluster 1 (median [IQR]: 43 [34,55]) and 2 (median [IQR]: 52 [39,65]) as the percentage of full-regimen vaccination was similar (15-16%).As explained above in the results section, our hypothesis is that Cluster 3 in wave 4 are elderly people who potentially survived across the first three waves.We observed that the COVID-19 patient profiles we identified exhibited varying degrees of COVID-19 outcomes, including in-hospital mortality, ICU assistance, and invasive mechanical ventilation.Interestingly, these outcomes were not used in the process of determining the clusters.We believe this may be because the selection of relevant variables to use in the clustering process was performed using a supervised method based on those COVID-19 outcomes.Thus, this finding has significant implications as can provide insight into the patient's potential prognostic presentation based on the observation of the clinical and socio-demographic characteristics provided at baseline.Additionally, although the patient's ceiling of care is a variable that must be taken with caution since it can change between waves and centers, Fig 1 shows that patients with the ceiling of care had poorer in-hospital mortality than patients without ceiling of care.For a more extensive discussion about this topic, we suggest reading Pallare ´s et al. [27].which deals with the same group of patients.There were differences in treatments received during hospital admission across waves.For example, treatments used in the first wave, such as hydroxychloroquine and lopinavir/ritonavir, were found to be ineffective and were replaced by others like remdesivir and corticosteroids.This is observed in the improvement of COVID-19 outcomes.For instance, we can see in Table 3 that the percentages of in-hospital mortality are 20.0%(wave 1), 16.5% (wave 2), 15.6% (wave 3), and 10.6% (wave 4).The last wave is also influenced by the massive incorporation of COVID-19 vaccines.This derived in that the clustering methodology for wave 4 determined a group of 346 patients (Table 2-Cluster 1, 44% of the total of the sample) with no in-hospital mortality, ICU, or IVM (Fig 1).
This study emphasized the heterogeneity among hospitalized-COVID-19 patients and how it could be associated to different degrees of severity of the studied disease outcomes.Some of the strengths of this study are its multicenter framework which provided a large number of patients covering a region and the inclusion of a vast number of demographic and clinical features.However, several limitations must be remarked: 1) This research did not intend to explain causality, only statistical associations and descriptions.Our results are obtained using the extent of observational data available.Ideally, we should have determined the clusters in a randomized design, but such design is impossible to pursue in the current settings; 2) the study cohort consisted exclusively of COVID-19-hospitalized patients in the Barcelona metropolitan south region and did not include outpatients, which can cause a selection bias.Thus, our results cannot be generalized to the general population; 3) our clustering approach focused on completed data.To address this limitation, we statistically compared the profiles of completed record patients with those non-completed.We obtain equivalent clinical and sociodemographic profiles.However, handling missingness, for instance using a multiple imputation strategy, is a possible and robust strategy to follow in future studies; 4) In contrast to predictive modelling, there are currently no unique gold standards for statistical validation of data clustering results.To overcome this restriction, we assessed validity within three domains 19 : face validity (the clustering structure obtained in each wave was recognizable by the clinical team and was reasonable and coherent given their expertise), construct validity (the resulting clustering structures using two unrelated clustering approaches (k-means and KAMILA) are equivalent), and criterion validity (the different patient groups has significantly different outcomes of disease); 5) the applied methodologies in this study may not capture fully the trends and patterns of the evolving COVID-19 pandemic, since are time-limited; future research over a longer period of time would be needed.
Further research should also consider determining the profiles stratifying by the ceiling of care information or adding information about outpatients to avoid selection bias.Additionally, our last general consideration would be that statistical methods such as the ones performed here can potentially exhibit hidden patterns in the data, but the consensus of the results with clinical teams is imperative to give clinical context and sensible understanding.We believe that our research has the potential to serve as a guideline for epidemiologists worldwide, extending its applicability beyond Spain.

Conclusions
Our study suggests that a single care model at hospital admission may not meet the needs of hospitalized-COVID-19 adults according to age, obesity, oxygen vital signs, high comorbidity, and severe pneumonia scores.A clustering approach appears appropriate to help physicians differentiate patients based on their profile and possible disease outcomes and, thus, apply multiple care intervention strategies, as another way of responding to new outbreaks of this or future diseases.

Fig 1 .
Fig 1. Bar plots of the % of the outcome (in-hospital mortality, ICU assistance, and invasive mechanical ventilation) and % of ceiling of care broken down by cluster.https://doi.org/10.1371/journal.pone.0302461.g001

Table 1 . Comparison of demographic characteristics, comorbidities, vital signs and severity scores at hospital admission between clusters for wave 1 (March 1st-April 15th, 2020).
The table only shows the statistically significant variables1.

Table 2 . Comparison of demographic characteristics, comorbidities, vital signs and severity scores at hospital admission between clusters for wave 4 (July 1st- August 31st, 2021).
The table only shows the statistically significant variables 1 .